As it turns out, people are very terrible at understanding numerical data, but can process and interpret visual information at remarkable speeds -- quite the opposite of computers in fact, and as such, you will nearly always want some sort of visual to accompany your analysis. In this exercise, we'll be using Matplotlib, a package in SciPy, utilizying MATLAB-like syntax, to generate many plots.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
x = np.linspace(-2*np.pi, 2*np.pi, 500)
y1 = np.sin(x)
y2 = np.cos(x)
2 - Using the default settings, use pyplot to plot $y_1$ and $y_2$ versus $x$, all on the same plot.
In [3]:
plt.plot(x, y1)
plt.plot(x, y2);
3 - Generate the same plots, but set the horizontal and vertical limits to be slightly smaller than the default settings. In otherwords, tighten up the plot a bit.'
In [4]:
# Switching to explicit plot
fig = plt.figure()
ax = plt.axes()
ax.plot(x, y1)
ax.plot(x, y2)
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);
4 - Generate the same plots using all settings from above, but now change the color and thickness of each from the defaults. Play around with the values a bit until you are satisfied with how they look.
In [5]:
fig = plt.figure()
ax = plt.axes()
# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);
5 - Generate the same plots using all settings from above, but now add some custom tickmarks with labels of your choosing. Which values would make sense given the functions we are using?
In [6]:
fig = plt.figure()
ax = plt.axes()
# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)
# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']));
6 - Generate the same plots using all the settings from above, but now change your plot spines so that they are centered at the origin. In other words, change the plot area from a "box" to a "cross".
In [7]:
fig = plt.figure()
ax = plt.axes()
# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5)
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5)
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)
# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))
# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None');
7 - Generate the same plots using all the settings from above, but now add a legend, with labels sine and cosine, to your plot in a position of your choosing.
In [8]:
fig = plt.figure()
ax = plt.axes()
# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='sine')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='cosine')
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)
# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))
# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')
# Add the legend in the lower left corner
ax.legend(loc='lower left');
8 - Now generate two more data sets, $$y_3 = sin(x) + sin(2x)$$ $$y_3 = cos(x) + cos(2x)$$ and add them to your plot, setting different color and line styles (for example, dotted). Be sure to adjust your scales and legend as needed. Also add a title to your plot.
In [9]:
y3 = np.sin(x) + np.sin(2*x)
y4 = np.cos(x) + np.cos(2*x)
In [10]:
fig = plt.figure(figsize=(8,8))
ax = plt.axes()
# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='$\sin(x)$')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='$\cos(x)$')
# Add the new functions
ax.plot(x, y3, c='red', linewidth=3, alpha=.5, label='$\sin(x) + \sin(2x)$')
ax.plot(x, y4, c='blue', linewidth=3, alpha=.5, label='$\cos(x) + \cos(2x)$')
# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-3, 3)
# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))
# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')
# Add the legend in the lower left corner
ax.legend(loc='lower left', frameon=False)
# Set the title
ax.set_title('Some trigonometric functions');
In this exercise we'll be using a real data set to test out the functionality of matplotlib.
1 - Go to the R Data Repository and download, or load directly, the Aircraft Crash data, load it into a Data Frame, and print the first few rows.
In [11]:
crash = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/gamclass/airAccs.csv')
In [12]:
# Change the first column name to 'id'
col = crash.columns.values
col[0] = 'id'
crash.columns = col
crash.head()
Out[12]:
2 - Generate a histogram for the number of deaths, using bin sizes of your choice. Be sure to adjust the axis and to add a title to make your plot aesthetically appealing.
First, let's take a look at the summary statistics of the data:
In [13]:
crash.describe()
Out[13]:
The column containing the number of deaths seems to be skewed to the right, so we expect a plot with some isolated bars to the right:
In [14]:
with plt.style.context('seaborn-white'):
fig = plt.figure(figsize=(16,8))
ax = plt.axes()
# drop nans because they cause an error with the hist command
n, bins, patches = ax.hist(crash['Dead'].dropna(), bins=50)
# set x axis limits
ax.set_xlim(0,600)
# adjust the number of ticks
ax.xaxis.set_major_locator(plt.MaxNLocator(50))
# add a title to the plot and the x axis
ax.set_title('Deaths in plane crashes')
ax.set_xlabel('Number of deaths')
Indeed there are many bars to the right which result almost invisible, let's look at the bins calculated by the hist command to see the values for such bars:
In [15]:
print(bins)
print(n)
As you can see, two thirds of the bins (approximately the ones over 180 deaths) have zero or one counts: this, combined with the greater values for the first beans, is causing the last bars to be invisible.
We can try to solve this in a few ways, first let's try to cut the y bars and set different units for the upper parts of the higher bars:
In [16]:
with plt.style.context('seaborn-white'):
# create two subplots, one for the higher bars and the other for the lower parts of every bar
fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
# plot the lower part of bars by setting the limit of the y axis to 90
ax[1].hist(crash['Dead'].dropna(), bins=50)
ax[1].set_xlim(0,600)
ax[1].set_ylim(0,90)
ax[1].xaxis.set_major_locator(plt.MaxNLocator(50))
# plot the higher part of bars by setting the limit of the y axis from 100 to 3500
ax[0].hist(crash['Dead'].dropna(), bins=50)
ax[0].set_ylim(100,3500)
# add title and x axis label
ax[0].set_title('Deaths in plane crashes')
ax[1].set_xlabel('Number of deaths')
# delete the spines between the plots
ax[0].spines['bottom'].set_color('None')
ax[1].spines['top'].set_color('None')
# add dashes to indicate the cut in the y axis (shamelessly copying code from stackoverflow!)
d = .01
kwargs = dict(transform=ax[0].transAxes, color='k', clip_on=False)
ax[0].plot((-d,+d),(-d,+d), **kwargs)
ax[0].plot((1-d,1+d),(-d,+d), **kwargs)
kwargs.update(transform=ax[1].transAxes)
ax[1].plot((-d,+d),(1-d,1+d), **kwargs)
ax[1].plot((1-d,1+d),(1-d,1+d), **kwargs)
Another possible solution is to clip the values and use the last bin to represent all the values over a certain number of deaths:
In [17]:
# Function for formatting the label of the clipped bin
def hist_formatter(value, pos):
if value == 100:
return ''
elif value == 98:
return str(int(value)) + '+'
else:
return str(int(value))
In [18]:
with plt.style.context('seaborn-white'):
fig = plt.figure(figsize=(16,8))
ax = plt.axes()
# clip the values over 100 and create 50 bins
ax.hist(np.clip(crash['Dead'].dropna(), 0, 100), bins=50);
# set x axis limit to 100
ax.set_xlim(0,100)
# add 50 ticks (start and end of the bar)
ax.xaxis.set_major_locator(plt.MaxNLocator(50))
# format ticks so to have 98+ for the last bar
ax.xaxis.set_major_formatter(plt.FuncFormatter(hist_formatter))
# add title and x acis label
ax.set_title('Deaths in plane crashes')
ax.set_xlabel('Number of deaths')
3 - Make some plots of total number of deaths with respect to time, making use of Pandas time series functionality. Again, be sure to make your plot aesthetically appealing.
First, let's transform the Date column in DateTime format and set it as the index of the DataFrame:
In [20]:
crash['Date'] = crash['Date'].apply(lambda x: pd.datetime.strptime(x,'%Y-%m-%d'))
crash.set_index(['Date'], inplace=True)
Now let's create a new DataFrame containing yearly aggregates of the data:
In [21]:
# Resample at the year start taking the sum and using 0 where the sum is NaN
yearly_crash = crash.resample('AS').sum().fillna(0)
yearly_crash.head()
Out[21]:
Last but not least, the plot:
In [22]:
with plt.style.context('seaborn'):
fig = plt.figure(figsize=(16,8))
ax = plt.axes()
# Plot the Dead column
ax.plot(yearly_crash['Dead'])
# Add a title
ax.set_title('Deaths in plane crashes by year')
# Set limits for the axis
ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
ax.set_ylim(-100, 3200)
# Increase the number of ticks
ax.xaxis.set_major_locator(plt.MaxNLocator(20));
There are two peaks of over 2500 deaths in a year between 1972 and 1985, let's determine which years they are:
In [23]:
yearly_crash[yearly_crash['Dead'] > 2500]
Out[23]:
The years are 1972 and 1985, let's take a closer look at them:
In [24]:
crash.loc['1985'].describe()
Out[24]:
In [25]:
crash.loc['1985'].sort_values('Dead', ascending=False).head(10)
Out[25]:
In [26]:
crash.loc['1996'].describe()
Out[26]:
In [27]:
crash.loc['1996'].sort_values('Dead', ascending=False).head(10)
Out[27]:
We can add some annotations to the plot to describe them:
In [28]:
with plt.style.context('seaborn'):
fig = plt.figure(figsize=(16,8))
ax = plt.axes()
# Plot the Dead column
ax.plot(yearly_crash['Dead'])
# Add a title
ax.set_title('Deaths in plane crashes by year')
# Set limits for the axis
ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
ax.set_ylim(-100, 3200)
# Increase the number of ticks
ax.xaxis.set_major_locator(plt.MaxNLocator(20))
# Add some annotations on the peaks
style = dict(size=10, color='black')
ax.text('1972-01-01', 3020, '1972: 105 crashes with a mean of 28 deads per crash', ha='center', **style)
ax.text('1985-01-01', 2700, '1985: 520 deads in Mt. Osutaka crash', ha='center', **style);
4 - We're now going to add in some data from a different source to take a look at the bigger picture in terms of number of passengers flying each year. Head over to the World Bank Webpage and download the .csv version of the data in the link. Clean it up and merge it with your original aircraft accident data above. Call this merged data set data_all.
In [29]:
# Load the data and set country name as the index
passengers = pd.read_csv('API_IS.AIR.PSGR_DS2_en_csv_v2.csv', skiprows=4)
passengers.set_index('Country Name',inplace=True)
passengers.head()
Out[29]:
We can ignore the first three and the last columns and we have to transpose the dataset transforming the year in a DateTime in order to be able to merge the two DataFrames:
In [30]:
# Transpose dataset
passengers = passengers.iloc[:, 3:-1].transpose()
# Set the year transformed in DateTime as the new index
passengers.reset_index(inplace=True)
passengers['index'] = passengers['index'].apply(lambda x: pd.to_datetime(pd.datetime(int(x), 1, 1)))
passengers.set_index('index', inplace=True)
passengers.head()
Out[30]:
Now we can sum all the columns to get the total number of passengers in a year and merge the two datasets:
In [31]:
# Sum all the columns
pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers'])
# Merge datasets using the indexes and selecting only some columns
all_data = pd.merge(pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers']), yearly_crash.iloc[:,1:], left_index=True, right_index=True, how='inner')
all_data
Out[31]:
The years between 1960 and 1969 doesn't seem to have passengers data, despite being present in the passengers' csv, so we are going to skip them:
In [32]:
all_data = all_data.loc['1970-01-01':]
all_data.head()
Out[32]:
5 - Using data_all, create two graphs to visualize how the number of deaths and passengers vary with time, and, as always, make your plots as visually appealing as possible.
In [33]:
# Function for formatting the number of passengers labels in millions unit
def million_formatter(value, pos):
return str(int(value / 1e6)) + 'M'
In [34]:
with plt.style.context('seaborn'):
# Make two subplots, the first with number of passengers and the second with number of deaths
fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
ax[0].plot(all_data['Passengers'])
ax[1].plot(all_data['Dead'])
# Set the limits of the common x axis
ax[0].set_xlim(all_data.index.values.min(), all_data.index.values.max())
# Set the titles for each plot
ax[0].set_title('Number of passengers by year')
ax[1].set_title('Deaths in plane crashes by year')
# Format the labels for the number of passengers
ax[0].yaxis.set_major_formatter(plt.FuncFormatter(million_formatter));
6 - Make a pie chart representing the number of deaths for each decade. Consult the pyplot documentation to play around with the settings a bit.
In [35]:
# Add a new column with the decade
all_data['Decade'] = all_data.index.year // 10 * 10
In [36]:
with plt.style.context('ggplot'):
fig = plt.figure(figsize=(16,8))
ax = plt.axes()
# Plot sum of Dead column grouped by decade
ax.pie(all_data.groupby('Decade')['Dead'].sum(),
labels=all_data['Decade'].unique(), # add labels
counterclock=False, # change the order of the slices to clockwise
startangle=90, # start from the top
shadow=True, # add shadows
autopct='%1.1f%%', # add percentage inside each slice
explode=(0.1, 0, 0, 0, 0) # make the first slice pop out a bit
)
# Set a title
ax.set_title('Number of deaths by decade')
;